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Spatio-temporal Video Parsing for 
Abnormality Detection 

Borislav Antic and Bjorn Ommer 


Abstract —Abnormality detection in video poses particular challenges due to the infinite size of the class of all irregular objects 
and behaviors. Thus no (or by far not enough) abnormal training samples are available and we need to find abnormalities 
in test data without actually knowing what they are. Nevertheless, the prevailing concept of the field is to directly search for 
individual abnormal local patches or image regions independent of another. To address this problem, we propose a method 
for joint detection of abnormalities in videos by spatio-temporal video parsing. The goal of video parsing is to find a set of 
indispensable normal spatio-temporal object hypotheses that jointly explain all the foreground of a video, while, at the same 
time, being supported by normal training samples. Consequently, we avoid a direct detection of abnormalities and discover them 
indirectly as those hypotheses which are needed for covering the foreground without finding an explanation for themselves by 
normal samples. Abnormalities are localized by MAP inference in a graphical model and we solve it efficiently by formulating it 
as a convex optimization problem. We experimentally evaluate our approach on several challenging benchmark sets, improving 
over the state-of-the-art on all standard benchmarks both in terms of abnormality classification and localization. 

Index Terms —Abnormality Detection, Video Analysis, Surveillance, Video Retrieval, Graphical Models, MAP Inference 
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1 Introduction 

With the rapid growth of video data, there is an increasing 
need not only for recognition of objects and their behavior, 
but in particular for detecting the rare, interesting occur¬ 
rences of unusual objects or suspicious behavior in the 
large body of ordinary data. Finding such abnormalities 
in videos is crucial for applications ranging from auto¬ 
matic quality control to visual surveillance. Due to the 
large within-class variability, recognizing normal objects is 
already difficult. Abnormality detection in crowded scenes, 
however, features the additional challenge that there exist 
infinitely many ways for an object to appear in unusual 
context (irregular object instance) or to behave abnormally 
(unusual activity). Most of these abnormal instances are 
beforehand unknown, as this would for instance require 
predicting all the ways somebody could cheat or break a 
law. It is therefore simply impossible to learn a model for 
all that is abnormal or irregular. Consequently, recent work 
on abnormality detection (Tl has focused on a setting where 
the training data contains only normal visual patterns. Thus 
a discriminative approach cannot be employed to directly 
localize irregularities in these benchmark datasets. But 
how can we find an abnormality without knowing what 
to look fori In spite of this fundamental problem, the 
main paradigm in abnormality detection is at present to 
independently classify individual video patches 121. ll3l or 
regions a. 

If we want to avoid the ill-posed problem of having to 
decide locally and separately about the abnormality of each 
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image region, we need to abandon the standard approach 
of object detection, which aims at detecting all objects in a 
scene independently from one another. Since abnormality 
detection is typically concerned with videos from a static 
camera as in surveillance or industrial inspection, robust 
background subtraction algorithms El can be used for 
foreground/background segregation. Our goal is then to 
find a set of spatio-temporal object hypotheses that jointly 
explain all foreground pixels. This means that normal object 
hypotheses, which can be learned from the training data, 
are spread over the spatio-temporal volume of a video in 
order to cover foreground pixels, while protruding into the 
background as little as possible. These hypotheses need 
to explain the appearance and behavior of the underlying 
video regions. As objects are mutually overlapping in 
crowded scenes, the spatio-temporal placement of the object 
hypotheses can only be determined jointly. Thus, our aim is 
to simultaneously select those object hypotheses, which are 
necessary for explaining the foreground and to identify for 
each selected hypothesis the best matching instance from 
the set of all normal training samples. Abnormal objects 
are then those hypotheses which are required for explaining 
the foreground, but which themselves cannot be explained 
by a normal training sample. Video parsing jointly infers 
all necessary object hypotheses, so that we can indirectly 
discover all abnormal objects present in a scene without 
actually knowing what to look for. 

Our video parsing approach consists of two stages. In the 
first phase, we detect a large number of object candidates 
in each video frame and then group them temporally 
into spatio-temporal object hypotheses. This shortlist of 
hypotheses is a superset of all candidates that might be 
eventually needed for parsing the video, i.e., it has a low 
false negative and high false positive rate. The object 
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candidates in individual frames are obtained by running 
a discriminative background classifier and keeping only 
those patterns which are very unlikely to be background. 
Subsequently, object candidates in individual frames are 
linked temporally according to their motion cues so as to 
establish the shortlist of spatio-temporal object hypothe¬ 
ses. In the second phase of video parsing, the goal is 
to select hypotheses from the shortlist that can explain 
the foreground, and to simultaneously find normal object 
instances that match those hypotheses. We formulate this 
as an inference problem in a graphical model whose goal is 
to maximize the probability of the foreground explanation 
in a video. The inference in the graphical model is cast 
as a convex optimization problem where the unknown 
variables indicate both, the selection of hypotheses from 
the shortlist and their corresponding normal object proto¬ 
types learned from the training videos. Correspondences 
between hypotheses and normal object prototypes are based 
upon their shape, location as well as their appearance and 
behavior. The probability of abnormality of each hypothesis 
necessary for explaining the foreground is then calculated 
using the results of inference. Beside identifying abnormal 
objects, video parsing also computes per-pixel probability 
of abnormality, which effectively segments abnormalities 
without having any training samples for them. 

We evaluate our approach on novel benchmark datasets 
for abnormality detection that feature highly crowded 
scenes. As an example, the UCSD pedl and ped2 anomaly 
detection and localization datasets |D| contain busy walk¬ 
ways teeming with walking pedestrians. Abnormalities are 
not staged, but they occur spontaneously and correspond 
to unusual objects (e.g., vehicles in a pedestrian zone) or 
behaviors (e.g., a person cycling across walkways) in the 
scene. The training data features only normal patterns with 
large intra-class variability, whereas the test set consists of 
normal and abnormal instances. Due to the small resolution 
of videos (a person in the scene is on average only 20 
pixels tall) and heavy occlusion between objects in the 
scene, learning models of visual patterns is difficult. We 
also increase the future utility of the UCSD pedl dataset 
by completing the pixel-wise ground-truth annotation for 
all videos in the test set that previously existed only for a 
small subset. The experimental results show a significant 
performance gain of our spatio-temporal video parsing 
approach in comparison to other state-of-the-art methods 
for abnormality detection. 

2 Related Work 

We discuss here the previous work on abnormality detection 
in videos. The related problem of object recognition and 
tracking in crowded scenes B, Q aims at recognizing and 
tracking objects of a known class in a scene, whereas our 
goal is to detect abnormal objects, all of them being in¬ 
stances of an unknown class. Therefore, object recognition 
and tracking are beyond the scope of this paper and the 
details on these topics can be found in (H. Majority of the 
work on abnormality detection relies on the extraction of 


semi-local features from video m, Co), im, 03, ca, 
that are then used to train a normalcy model. Abnormal¬ 
ities are detected if the normalcy model does not fit the 
data. Some approaches O, |5| are based on manually 
specifying constraints that define the condition of normalcy, 
whereas other methods lO, |[l5l, ifTbl . ITtII . (TSl, learn 
the normalcy model directly from data in unsupervised way. 

The approach of Adam et al. 1201 focuses on individual 
activities occurring only in selected parts of a scene. Kim 
and Grauman ED detect abnormalities using a spatio- 
temporal Markov random field that adapts to abnormal 
activities in videos. Toy et al. fllli use active learning 
methodology to integrate human feedback into the detection 
of abnormal events and behaviors. Unsupervised topic 
models are used for detection of abnormal behaviors in 
f23\ , (241. Hospedales et al. (251 propose a semi-supervised 
multi-class topic model to classify and localize the subtle 
behavior in cluttered videos. Mahadevan et. al m detect 
unusual objects in crowded scenes by jointly modeling 
the dynamics and appearance with mixtures of dynamic 
textures. Li et al. (^ use the mixture of dynamic textures 
at multiple scales to detect abnormalities in a conditional 
random field framework. 

Kratz and Nishino E3 develop a statistical model of 
local motion patterns in very crowded scenes to find 
abnormalities as local volumes with a large motion vari¬ 
ation. Benezeth et al. (28ll use low-level features to learn 
the co-occurrence matrix of normal behavior, and apply 
Markov random field to find deviating behaviors. Cong 
et al. (29l use sparse reconstruction cost implemented on 
a normal dictionary of local spatio-temporal patches to 
detect local and global abnormalities. Saligrama et al. (30l 
propose optimal decision rules for detecting local spatio- 
temporal abnormalities. An efficient sparse combination 
learning framework that achieves decent performance in the 
detection phase is proposed by Lu et al. ED 

Instead of independently detecting abnormal regions in 
video as in other approaches, abnormalities are discov¬ 
ered indirectly after establishing a set of spatio-temporal 
hypotheses that provide complete explanation of the fore¬ 
ground. Previous approaches related to scene parsing differ 
in that a parametric scene (321, ES or object model El, 
El, (36l or a non-parametric exemplar-based representa¬ 
tion for objects E3, (38l can be constructed. In contrast 
to these methods we are not provided any training samples 
for the abnormalities we are searching for but we can 
leverage a foreground/background segregation. In contrast 
to our previous sequential video parsing (^ that parsed 
video frames only spatially, one after another, the approach 
proposed in this paper performs a joint spatio-temporal 
parsing of video frames. This methodological extension is 
used to resolve both the spatial and temporal dependencies 
between objects in a scene. The new convex formulation 
of the inference process that improves upon the previous 
locally optimal inference method allows us to efficiently 
aggregate evidence from different frames and decide about 
their abnormalities in a globally optimal manner. 


3 



(d) (e) (f) 


Fig. 1. Successive stages of the video parsing: (a) Source frame of a video, (b) Foreground probability map 
that needs to be explained by video parsing, (c) Object candidates found by inverted background detector, (d) 
Spatio-temporal object hypotheses found by temporal grouping serve as an input to the video parsing, (e) Subset 
of spatio-temporal object hypotheses that is selected by video parsing to explain the foreground pixels, (f) Normal 
object prototypes found by video parsing to explain the selected object hypotheses. Best viewed in color. 


3 Model for Spatio-temporal Video 
Parsing 

In case of a stationary camera, the foreground/background 
segregation becomes feasible due to background subtrac¬ 
tion. The foreground mask renders it then possible to turn 
the abnormality detection problem into a task of video 
parsing. The goal is thus to explain all the foreground 
of a video using object hypotheses and to explain each 
hypothesis by an object model learned from the set of 
normal training videos. The underlying statistical inference 
problem has to be tackled jointly for all hypotheses, since 
hypotheses can explain each other away. Abnormalities 
are then those hypotheses that are required to explain the 
foreground but which themselves cannot be explained by 
any prototype from the normal object model. 

Foreground segmentation. Scenarios for abnormality 
detection often involve the analysis of videos from static 
cameras. Background in such videos is constant or changes 
slowly over time, hence it can be learned effectively from 
a video. The resulting background model can then be 
applied to find all foreground pixels in the video. The final 
foreground/background segmentation is represented by a 
binary variable /j G {0,1} for all pixels j in frame t. 

Background subtraction assumes that each frame P of a 
video can be expressed as the background model plus 
a sparse vector P — whose nonzero elements are the 


foreground pixels. After stacking successive video frames 
as columns in a matrix I = [P~^ • • • /^], we want to find 
the low-rank background model B such that the sparsity 
inducing norm of the difference I — B is the smallest 
possible. Following the approach of Wright et al. El, we 
approximate the rank of the matrix 5 by a nuclear norn|^ 
II • II* and use as the sparsity inducing norm, so that 
the background subtraction becomes the following convex 
optimization problem, 

B = argmin ||5||* -h ||/ - B\\i. (1) 

B 

Now that we calculated the background model B, it 
can be used to find all foreground pixels j, /j = 1, as 
those that have a large discrepancy between the observation 
/j and the background model The probability that a 
pixel is foreground P{fj = 1) is obtained by the sigmoid 
transformation of the difference of pixel’s intensity and 
background model, 

P(/] = 1) = 2(1 +exp(-A||J* - 5*11))“' - 1. (2) 

Pixels with foreground probability greater than 0.5 are con¬ 
sidered as foreground, / • = 1, and others as background, 

/i=o. 

1. Nuclear norm is the sum of the singular values of the matrix and is 
a convex function. 
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Fig. 2. Probabilistic graphical model of the spatio- 
temporal video parsing. The left plate contains all 
spatio-temporal hypotheses h with their descriptors dh 
and locations Ih- The right plate comprises all pixels 
j with their foreground labels /] g {0,1}. By video 
parsing, we infer the set of hypotheses, Oh g {0,1}, 
that are necessary for explaining the foreground, and 
jointly explain the selected hypotheses by the normal 
object prototypes ruh g Finally, for each 

selected hypothesis h \Ne decide if it is abnormal, 
ah G { 0 , 1 }, and also mark foreground pixels that 
belong to abnormal objects, g {0,1}. 


Shortlist of Object Hypotheses. For parsing the video, 
we need to specify a list of spatio-temporal object hypothe¬ 
ses that is sufficient for explaining foreground pixels in 
video. An input to our video parsing algorithm consists 
of the most suitable object hypotheses for the task of 
foreground explanation. In Sect. we explain the procedure 
for creating a shortlist of object hypotheses that has a 
high recall, i.e. where the majority of true-positive object 
hypotheses is included in the shortlist. However, as the 
precision rate of the proposed shortlist is low, there will be 
many superfluous hypotheses that are then explained away 
by others during video parsing. 

We assume that hypotheses from the shortlist span a 
time window {t —r,..., t}. Each hypothesis h represents a 
spatio-temporal tube covering locations Ih := ... /^). 

This is a trajectory of locations = {x\ y\ which 

specify the center and the scale s\ of a candidate 

object h at time t. The scale of an object represents its size 
relative to the size {W^ H) of the object model. The support 
region of an object hypothesis h at time t is the bounding 
box of size {s\W^ and the set of all pixels j that 

belong to it is denoted by Sf^. 

The goal of video parsing is then to select a subset 
from the shortlist of hypotheses that is both necessary and 
sufficient for explaining the foreground of a test video 
while, at same time, finding normal object prototypes that 
explain the hypotheses of the subset (see Fig. [^. 

Spatio-temporal object descriptor. A spatio-temporal 
hypothesis h matches its corresponding normal object pro¬ 
totype both in appearance and motion. Thus, we need a 
spatio-temporal descriptor dh to capture the essence of 
both appearance and motion of hypothesis h. We build 
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Fig. 3. The normal object model consist of a set 
of spatio-temporal shape prototypes, each being a 
sequence that captures the temporal evolution of a 
particular shape. Prototypes are accompanied by the 
appearance and motion descriptors. 


a spatio-temporal descriptor dh := ••• by 

concatenating frame-wise descriptors d\ calculated at each 
time t. Frame-wise object appearance is represented by 
the spatial derivatives of pixel’s intensity in the support 
region of hypothesis h. Analogously, object motion is 
represented by the temporal derivatives of pixel’s intensity. 
The appearance and motion representations are combined 
into frame-wise descriptor. 


^ ' \ dx' dy ' dt y jesi 


(3) 


Since the spatio-temporal descriptor dh is long and redun¬ 
dant, we build its compact representation by applying PCA 
transformation that projects onto eigen-space such that most 
of the signal variation is preserved (about 95%). 

Activating hypotheses needed for parsing. Not all 
object hypotheses from the shortlist are needed to ex¬ 
plain foreground pixels in video. Video parsing retains 
only the indispensable hypotheses that cannot be explained 
away by other hypotheses. Therefore, we use an indicator 
variable Oh G {0,1} for each hypothesis h to designate 
the hypothesis as active/inactive. To initialize parsing, a 
discriminative classifier is trained to distinguish background 
spatio-temporal patterns from anything else. This back¬ 
ground classifier computes the probability that hypothesis 
h is background, P{oh = 0\dh), which is then inverted to 
obtain the foreground probability. A hypothesis with high 
foreground probability can still become inactive if it gets 
explained away by others during video parsing. 

Matching with the object model. Video parsing jointly 
explains foreground pixels with object hypotheses, and 
active hypotheses {h : Oh = 1} with normal object 
prototypes learned from the training data. The object model 
consists of K normal object prototypes that represent a 
diversity of normal object’s shape, appearance, and motion. 
Video parsing then determines for each selected hypothesis 
h which of the K prototypes best explains it. The prototype 
that video parsing associates with hypothesis h is indicated 
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(b) 


Fig. 4. (a) Spatio-temporal tubes illustrate the hy¬ 

potheses selected by video parsing. Normal shape 
contours that explain the hypotheses are shown over¬ 
laid. (b) Superfluous hypotheses are eliminated by the 
statistical inference of explaining away. The idea is 
the following: Object hypothesis (yellow) is used at 
the beginning of video parsing to explain the fore¬ 
ground pixel in the middle. Other object hypotheses 
(red and blue) are introduced later to explain the top 
and bottom pixels. However, the pixel in the middle is 
also explained by new hypotheses, so that the original 
(yellow) hypothesis is not needed anymore and it can 
be eliminated. 


by the variable rrih G {1,..., iG}. Sect. [^explains in detail 
the learning of the normal object prototypes. For the time 
being, we assume that K normal object prototypes are 
provided as input to the parsing algorithm. 

For each hypothesis h the best prototype rrih G 
{1,..., iG} from the learned object model is sought (Fig. 
[^. For abnormal objects all prototypes will obviously have 
high matching costs. Consequently, the probability that 
prototype rrih is matched to a hypothesis h in a. query 
video depends on how similar they are in both appearance 
and motion, A{dh,dmh)- Here, A denotes a function that 
measures the distance of spatio-temporal descriptors in 
the corresponding feature space. Given the spatio-temporal 
descriptor dh of hypothesis h, the probability of matching 
prototype rrih with the hypothesis h is the Gibbs distribu¬ 
tion. 


P{mh\dh) = 


exp{-l3A{dh, dmj), 


(4) 


where Z{dh) is the partition function used to normalize the 
probability distribution. 



fD 

U 

{/) 


Fig. 5. The distribution of locations of normal object 
prototypes estimated by the Parzen windows at multi¬ 
ple scales (represented as horizontal slices). 


Moreover, normal objects typically occupy some location 
in a scene more often than other, and also tend to move at 
a certain speed. For example, cars are more likely to drive 
on roads than on sidewalks, whereas pedestrians are more 
likely to walk on sidewalks. Consequently, the probability 
of observing hypothesis h that matches the prototype rrih 
depends on its location and velocity 

P(4|m,) oc - IV)- (5) 

The normal location and velocity distributions and 
pvei learned for each of the K object prototypes using 
the Parzen window density estimator (see Fig. [^. 

Therefore, the probability that hypothesis h matches to 
the normal object prototype rrih is 

P{mh\oh, dh, Ih) oc Oh • P{mh\dh) • P{lh\mh). (6) 

Explaining foreground pixels. Video parsing selects 
hypotheses, {h : Oh = 1}, and finds corresponding 
normal object prototypes rrih to explain the foreground. 
The foreground probability of a pixel j depends on all 
hypotheses h that overlap with pixel j. Given the support 
regions of all hypotheses h, {h : j e Sh} is the set 
of hypotheses that cover the pixel j. The probability that 
pixel j is background is equal to the product of pixel’s 
background probabilities for each single hypothesis h that 
contains the pixel j. Even if all hypotheses claim that 
pixel j is background, P{f* = l\oh.,mh,lh) = 0, V/i, 
we still allow it to be foreground with a small foreground 
probability Pq > 0 . Thus, foreground probability of pixels 
j given all hypotheses is 

P{fj = M{oh,mh,lh}h) = 

1 - (1 - Po) n(l - = Moh,mn,lh)). (7) 

h 

The foreground probability given a single hypothesis h, 
P{fj = l\oh,mh,lh)^ depends on the shape of the corre¬ 
sponding normal object prototype rrih. In the training data. 
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the prototype rrih covers pixels j' with some probability 
Prrihifj' = !)• Thus, the foreground probability of pixel j 
under hypothesis h is obtained by taking its corresponding 
object prototype ruh and “pasting” the foreground proba¬ 
bility of rrih at the location of h. The model now needs 
to be brought into the reference frame of h by scaling 
and translating it, i.e. = s\ • /j, + {x\ . Then the 

foreground probability of pixel j given h becomes 


P{fj = Moh.rrih.lh) = Oh • l[j G Sh] 

• ^ 1[Z‘ = si ■ l], + (4 viY] ■ Pm,Ur = !)• (8) 

j' 

Here ![•] denotes the indicator function. In Eq. the 
foreground probability of pixel j is set to zero if hypothesis 
h is inactive, o/,, = 0, or the pixel j does not belong to the 
support region of hypothesis h, j ^ Sj^. 


4 Inference by Foreground Parsing 

The goal is now to estimate which of the hypotheses are 
actually needed for explaining the foreground and to find a 
matching normal object prototype for each hypothesis. For 
abnormal hypotheses Eq. will yield low probabilities. If 
foreground /j = 1 is observed and the pixel is covered by 
a hypothesis h, and no other hypothesis can be found that 
could explain the presence of the foreground at that pixel, 
then the probability of activation of hypothesis h increases. 
This leads to the statistical inference of explaining away. 
For an observed variable /j different hypotheses h that 
share the same pixel j become statistically dependent so 
that the absence of one hypothesis can dictate the presence 
of another (see Fig. |^. 


4.1 Joint Inference by MAP 


the normal object prototypes, 

J({oh,'mh}h) ■= “F] log-P(/|I 

3 

" -V-" 

3 y h) 

-^(\og P{ 0 h\dh) + log P{mh\oh,dhJh)^ • (10) 

h 

"-V-^ 

— • ■)PP'h }h.) 


To find the MAP solution, we introduce a parsing indica¬ 
tor /c G {0,1}, that equals one if hypothesis h is active. 
Oh = I, and their corresponding normal object prototype is 
rrih = k. 


Zh,k ■= Oh ■ = k], V/i, Vfc G {1,.. .,K}. (11) 


To keep the notation simple, let the vector z^h •= 
{zh^i^ > > > ^ Zh^K)^ denote the parsing indicators of hypoth¬ 
esis h, and the vector z := {zh}h denote the parsing 
indicators of all hypotheses together. The following lemma 
now states that the hypotheses explanation Jh{') can be 
expressed as a linear function of the parsing indicator z. 

Lemma 4.1: The hypotheses explanation term 
Jh{{oh,mh}h) in Eq. 
parsing indicator z, i.e. 


10 is a linear function of the 


Jhi{0h, mh}h) = b’^z + bo, 


( 12 ) 


where the parameter vector h = {bh,k}h,k and scalar bo do 
not depend on the parsing indicator z. The proof of Lemma 
4T] is given in Appendix 


To express the foreground explanation term Jj(') as a 
function of the parsing indicator z, we first define a function 
^jt(-) that is parametrized by the foreground value /j of 
pixel j. 


Based on the foreground segmentation mask /j and the 
shortlist of hypotheses h with spatio-temporal descriptors 
dh and trajectories we need to jointly infer all hidden 
variables {oh^mh}h in our graphical model (Fig. 1^. Fol- 
lowing a maximum a posteriori (MAP) approach yields a 
set of hypotheses that best explain the foreground and are 
themselves explained by the normal object prototypes, 

{o/j,, 772/2,~ max i^(|o/2,, 772/2,}/2,|{(i/2,,//2,}/2,, {/.• }jf) 
{oh,mh}h 

(X n P (^fj I {Oh 5 772 / 2 ,5 Z/ 2 ,}/ 2 ,^ P{Oh Idh^P^'Olh I Oh 5 dh ^ lh\ 

3 h 

(9) 

Instead of explicitly maximizing the posterior probability, 
we take a negative logarithm of Eq.|^and thereby obtain the 
energy function J(-) which is then minimized. Furthermore, 
we decompose the energy function J(-) into two terms, 
Jj(-) covering the explanation of foreground pixels j, and 
J/i(-), which involves the explanation of hypotheses h by 


:= (1 - fj)x - /| ■ log(l - e ®), a:; > 0. (13) 


The introduced function ^/*(-) is convex as we show in 
the following lemma. 


Lemma 4.2: The function ^jt(x), x > 0 (Eq. 


is 

’Hrhe 


convex for nonnegative values of the parameter /, 
proof of Lemma 4.2 is given in Appendix 

We also introduce a joint shape prototype vector w := 
[wj • • • wj^]'^ that is obtained by concatenating all individ¬ 
ual shape prototype vectors Wk, k E {1,..., K} (c.f. Fig. 
[^. The component Wkj' equals the negative logarithm of 
the background probability of pixel j' in the normal shape 
prototype Wk, 


Wk,r = - iog(i - PkUr = !))• (^14) 

The following lemma establishes a relationship between the 
foreground explanation term Jj(z), the parsing indicator z 
and the joint shape prototype vector w. 
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Fig. 6. Values of the objective function J(z) (Eg. pTs) 
that are obtained as part of the convex optimization 
procedure that is used to solve the proposed video 
parsing problem. 


Lemma 4.3: The foreground explanation term Jj(-) is 
the sum over all pixels j of convex functions whose 

argument is a bilinear function of the parsing indicator z 
and the joint shape prototype w, 

Jj{z) = Co). (15) 

j 


The parameter matrices Cj and scalar cq do not depend 
on the parsing indicator z or joint shape prototype w. The 
proof of Lem ma |4.3| is given in Appendix 

In Lemmas |4 .3 1 and [TT] we expressed the foreground and 
hypotheses explanation terms Jj(-) and Jh{') as convex 
functions of the parsing indicator z. Therefore, the video 
parsing objective function J(-) := Jj{') Jh{') (Eq- 

is a convex function of the parsing indicator z. To 
efficiently solve the optimization problem, we relax the 
parsing indicator z to the positive simplex, Zh h 0 and 
Zh < 1, yh. The last inequality follows from Eq. 
and the fact that Oh <1. 

The MAP inference in our video parsing model is thus 
equivalent to the following constrained convex optimization 
problem. 


argmin J(z) = b^z + 6 o + ^ + cq), 

^ 3 

s.t. Zh y 0 and z^ < 1, V/i. (16) 


After finding the optimal value of the parsing indicator z, 
we calculate the hypothesis indicator Oh, and the matching 
normal object prototype rrih of hypothesis h, as 

K 

^ ^ ^h,k ; (E7) 

k=l 

rrih = argmaxz/,,/^. ( 18 ) 

k 


4.2 Solving the Convex Optimization Probiem 

In the previous section, we showed that the joint inference 
of variables {oh^ '^h}h can be achieved by minimizing the 
MAP objective function J(z) to obtain the parsing indicator 


z (Eq. [^, that belongs to the Cartesian product Z = Z/^, x 
• • • X Z/i of positive simplexes, 

Zh = {z?i : z/i b 0 and zh < 1}. (19) 

The function J(z) is convex, smooth and bounded on the 
set Z. The projected gradient method ll40l . 

z”+i = Projz(z” - J(z”)), (20) 

converges to the global optimum of the convex optimization 
problem in Eq. because of the Lipschitz-continuity 
of the first derivative of function T>jt(-) as stated in the 
following lemma. 

Lemma 4.4: The first derivative of the function 
is p-Lipschitz continuous in argument x > cq, i.e. there is 
a constant p such that 

- ^'pXx2)\ < p\xi - X2\, ^Xx,X2 > Cq. (21) 

The proof of Lemma |4.4| is given in Appendix [D| 

The projection Proj 2 ;(-) requires each z^ to be projected 
onto the positive simplex Z/^. The projection onto the 
positive simplex is calculated by applying the method of 
Duchi et al. ED. The projected gradient method finds the 
solution of the video parsing after only few tens of iterations 
(Fig.©. 

4.3 From Inference to Abnormalities 

Video parsing analyses the foreground in a video and 
identifies objects that have atypical appearance or behave 
suspiciously, to label these as abnormal. Abnormalities can 
also be localized on the level of pixels, where it leads 
to a segmentation of regions in the video that contain 
irregular spatio-temporal patterns. Subsequently, we see 
how both the object-level and pixel-level abnormalities can 
be detected in video, based on the inference results of our 
video parsing approach. 

Object-level abnormalities. A hypothesis h is an abnor¬ 
mal object, a/i = 1, if it is indispensable for explaining the 
foreground, = 1, but it does not have a matching normal 
object prototype, i.e., the best estimate fhh of a matching 
prototype is unlikely to explain the hypothesis (cf. Eq. ©, 

P{ah = l|oft = Oh, rrih = fhh) 

cx OhP{oh = l\dh)P{mh 7 ^ fhh\oh = Oh, dh, h) (22) 

oc OhP{oh = l\dh)(l - P{mh = mh\dh)P{lh\mh = fhh)'j 

(23) 

Pixel-level abnormalities. Similarly, a pixel j is part 
of an abnormal object, = 1, if it is in the foreground, 
/j = 1, and at least one of the hypotheses that extend over 
this pixel, {d : j G 5^}, is abnormal, 

P{a] = l\fj,{ah}h:jesO 

^ fj ■ = 1) ■ = l\oh,mh). (24) 

h-.jesi 
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False Positive Rate 

Fig. 7. Frame-wise abnormality labeling on the UCSD 
ped1 dataset. Performance measures AUC and EER 
given in Tab.[T|are calculated from the ROC curves. 


5 Learning an Object Model for 
Video Parsing 

Parsing query videos for abnormality detection requires an 
object model. We use training videos that contain a large 
number of normal object samples but no abnormalities 
to train the normal object model that consists of proto¬ 
types representing the normal object shape, appearance, 
and motion. As ground truth locations of objects in the 
training videos are not provided, we infer them by video 
parsing. However, for video parsing we need to know the 
normal object prototypes. A standard approach for solving 
such a problem of mutual dependencies is expectation- 
maximization (EM) 14^ . Given an initial estimate of the 
normal object prototypes, we use them to parse the train¬ 
ing videos, i.e. discover hypotheses that best explain the 
foreground and are matched to the object prototypes (E- 
step). Thereafter, we update the object prototypes using the 
matched hypotheses (M-step). We find the object model by 
iterating the EM steps until convergence. 

The goal of learning is to estimate the normal object 
shape prototypes {wk}k (Efi- and their corresponding 
spatio-temporal descriptors {dk}k^ k G {l,...,Ar}. The 
objective function for learning is the same as for the 
inference (Eq. EH, except that it is now minimized jointly 
in terms of shape prototypes {wk}k^ their spatio-temporal 
descriptors {dk}k^ as well as the parsing indicator z (Eq. 

m 

argmin J{z,{dk,Wk}k) = A(z, {4}fe) Jj{z,{wk}k), 

{dk,Wk}k,z 

s.t. Wk h 0, V/c, Zh h 0 and z^ < 1, V/i. (25) 


The hypotheses explanation term Jh{') is a function of 
the parsing indicator z and the spatio-temporal descriptors 
{dk}k, 

Jh{z,{dk}k) = (d^^Zh,kMdh,dk) + h^z + bo, (26) 

h k 


where the parameters b and bo do not depend on the parsing 
indicator z or the spatio-temporal descriptors {dk}k (see the 
proof of Lemma 4^ in Appendix [A|. 


Erom Eq. 15 we see that the foreground explanation term 


Jj (•) depends in a convex way on both the parsing indicator 
z and the joint shape prototype vector w. 

Procedure for the object prototype learning. We now 
explain the EM algorithm used for solving the optimization 
problem of Eq. 

E-step. Given the object prototypes, we parse the training 
videos to infer the parsing indicator z (Eq. that yields 
the hypothesis indicator Oh for each hypothesis h, and its 
corresponding normal object prototype ruh (Eq.p^andp^. 

M-step. We estimate the shape prototypes {wk}k and 
their spatio-temporal descriptors {dk}k from the results of 
video parsing. As hypotheses overlap in training videos, 
the corresponding shape prototypes become mutually de¬ 
pendent and thus need to be learned jointly. We estimate 
the joint shape prototype vector w by the following convex 
optimization, 

w = argmin Jj(z, u;) = ^ft(w~^CjZ-\-co). (27) 


The convex optimization problem of Eq. can be solved 
efficiently by the projected gradient method that we used 
for solving the MAP inference problem (Eq. [ 


w 


n+l 


= Proj I«;|(ii;^ - an\/wJj{z,w'^)). 


(28) 


The spatio-temporal descriptors {dk}k^ k G {!,... ,iT} 
are estimated separately for each normal object prototype. 


dk = argmin^ Zh,k^{dh, dk). 

dk h 


(29) 


In case of a squared Euclidean distance function, 
A{dhjdk) = \\dh — dk\\‘^, there is a closed-form solution 
for dk, given as an average of spatio-temporal descriptors 
dh of those hypotheses that are matched to prototype k by 
video parsing, 

4 = ( 30 ) 

^h,k 

The EM algorithm assumes uniform location and velocity 
distributions (Eq.[^ for normal object prototypes. However, 
after the EM algorithm is converged, we estimate the pro¬ 
totype’s location and velocity distributions from matched 
object hypotheses by the non-parametric Parzen windows. 

Initialization. To start the EM algorithm, we need an ini¬ 
tial estimate of the normal object model. After background 
subtraction, some foreground segments correspond to iso¬ 
lated normal objects that can be used to initialize our object 
prototypes. However, foreground/background segmentation 
produces also many foreground segments which correspond 
to interacting objects (doublets, triplets etc.). These seg¬ 
ments are more complex and can be analyzed only by 
video parsing. Consequently, we need to infer which of the 
training foreground segments correspond to isolated normal 
objects and estimate object prototypes based upon them. We 
observe that isolated normal objects create compact clusters 
in the feature space. On the other hand, segments that are 
mixtures of two or more objects are diverse and spread 
out in the feature space. To detect isolated normal objects. 
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(a) 


(b) 


(c) 


Fig. 8. (a) Frame-wise abnormality labeling for the UCSD ped2 dataset, (b) Pixel-wise abnormality prediction 
evaluated by the partially annotated UCSD ped1 dataset, (c) Pixel-wise abnormality prediction that is evaluated 
using the full annotation of the complete UCSD ped1 dataset that we have assembled. In all of these cases our 
approach significantly improves upon the state-of-the-art, which can also be seen from the corresponding AUC 
and RD values provided in Tab.[j]and[^ 


TABLE 1 

Performance measures on the UCSD ped1 dataset 



frame-wise 

pixel-wise 

partial 

pixel-wise 

full 


AUC 

(%) 

UHR 

(%) 

AUC 

(%) 

RD 

(%) 

AUC 

(%) 

RD 

(%) 

Social force 1431 

67.5 

31 

19.7 

21 

- 

- 

MPPCA l2l1 

59 

40 

20.5 

18 

- 

- 

Social force 

+ MPPCA 

67 

32 

21.3 

28 

- 

- 

Adam [2^ 

65 

38 

13.3 

24 

- 

- 

Sparse 1291 

86 

19 

46.1 

46 

- 

- 

LSA l30l 

92.7 

16 

- 

- 

- 

- 

SCL l3ll 

91.8 

15 

63.8 

59.1 

- 

- 

MDT (ij 

81.8 

25 

44.1 

45 

- 

- 

HMDT CRF l26l 

- 

17.8 

66.2 

64.8 

82.7 

74.5 

SVP 1391 

91 

18 

75.6 

68 

83.6 

77 

STVP 

93.9 

12.9 

80.3 

75.2 

84.2 

79.5 


we cluster all the foreground segments and then select 
compact clusters in the feature space that correspond to 
isolated objects. We use Ward’s method for agglomerative 
clustering to minimize the variance of clusters. Normal 
object prototypes are then computed as the centers of 
compact clusters. 

6 Creating Initial Object Hypotheses 

To initialize video parsing, we need a shortlist of spatio- 
temporal object hypotheses h (Sect. |^. A spatio-temporal 
hypothesis h consists of a sequence of object candidates in 
individual frames that are linked temporally. In this section 
we explain a method for producing per-frame object can¬ 
didates and group them temporally based on their motion 
to obtain the shortlist of Sect. Thereafter, we explain 
how to fill-in per-frame candidates that were missed during 
temporal grouping. 

Temporal grouping of per-frame object candidates. To 

detect per-frame object candidates, we apply an inverted 
background detector that is trained to distinguish back¬ 
ground patterns from everything else. The inverted back¬ 
ground detector is trained on background and normal fore¬ 


ground segments obtained from training videos by back¬ 
ground subtraction. The discriminative appearance-based 
classifier retains in each frame the object candidates that are 
least likely to be background. The standard non-maximum 
suppression (NMS) then removes some of the candidates 
based on the overlap criteria. The discriminative classifier is 
trained using a linear SVM l44l with frame-wise descriptor 
of Eq.|^ extracted from background/foreground segments of 
training videos. 

We then employ agglomerative clustering to perform 
a temporal grouping of candidates. This yields spatio- 
temporal hypotheses h, which are sequences of per-frame 
candidates. As usual, the clustering starts with singleton 
clusters (each candidate being a cluster). Then, in each 
round of the recursive clustering, those groups of per- 
frame object candidates which are most similar based on 
their motion and which do not share the same frames are 
grouped. The motion of a candidate is represented by the set 
of trajectories obtained by tracking the edge points inside 
the support region of a candidate. For tracking the feature 
points we use optical fiow vectors that are previously 
computed by the method of E). We now define similarity 
of two object candidates as the ratio of the number of 
feature point trajectories that are shared by two candidates 
over the total number of trajectories in two candidates. As 
the result of temporal grouping, we obtain a shortlist of 
spatio-temporal hypotheses h. 

Filling-in missing candidates by Kalman filter. The 
inverted background detector used for producing object 
candidates in each frame typically has a number of missed 
detections. These are the frames in which none of the object 
candidates is associated with a hypothesis h. We fill-in the 
missed object detections with the contextual help of other 
per-frame candidates that belong to the same hypothesis 
h. Therefore, the location of a missed object candidate 
at time t is estimated from the available object candidate 
locations at times by a non-causal Kalman 

filter. 

The shortlist of object hypotheses established by tern- 
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normal abnormal 


Fig. 9. Rows show results on different frames of the UCSD ped1 benchmark. Column i) the initialization of the 
video parsing by a shortlist of object hypotheses, column ii) hypotheses selected by video parsing with the best 
matching shape prototype colored according to abnormality probability P{a\ = 1), column iii) foreground pixel 
abnormality probabilities P{a^j = 1), column iv) results by the H-MDT CRF approach Best viewed in color. 


poral grouping has a high recall at the cost of low preci¬ 
sion. By maximizing the recall, the shortlist includes all 
relevant hypotheses, while still maintaining a reasonable 
total number thereof (about one hundred). Since hypotheses 
are created by bottom-up grouping, there will, however, be 
many spurious hypotheses that can only be eliminated by 
video parsing. 

7 Experimental Evaluation 

We use three standard state-of-the-art benchmark sets for 
evaluating our video parsing approach and comparing its 
performance to the other state-of-the-art methods. We first 
analyze the detection results of our approach on the UCSD 
benchmark sets pedl and ped2, then we present additional 
results on the UMN benchmark set. We apply the standard 
evaluation protocol of the datasets. 

7.1 Evaluation on the UCSD Anomaly Datasets 

7.1.1 Datasets Description 

We use the challenging UCSD anomaly datasets pedl and 
ped2, that were recently proposed by Mahadevan et al. (H 
for measuring the performance of abnormality detection 


algorithms. Both datasets consist of videos recorded in 
crowded walkway scenes that also feature lots of chal¬ 
lenging abnormal instances which are objects with unusual 
appearance or behavior. The UCSD pedl set contains 34 
training and 36 test videos that are all 200 frames long. Due 
to the low resolution of pedl videos, the pedestrians who 
walk towards and away from the camera are only 10 — 25 
pixels high. In the UCSD ped2 dataset there are 16 training 
and 12 test videos that have a variable length (at most 180 
frames). Pedestrians in these videos are about 30 pixels 
high. Videos from both benchmark sets are very crowded, 
so that object heavily occlude one another. 

Abnormalities in the UCSD datasets are not staged but 
occur naturally in the scene and can be grouped into: i) 
objects that do not fit to the context of the scene, such 
as a car on a crowded walkway, or ii) objects that look 
normal but behave in unusual way, such as people that 
cycle or skateboard across the walkway or walk in the lawn. 
Abnormalities from the UCSD benchmark sets include 
also carts and wheelchairs. We emphasize that the training 
videos consist only of normal objects and actions, so that 
a model for abnormalities cannot be learned from it. 
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TABLE 2 

Performance measures on the UCSD ped2 dataset 



frame-wise 

pixel-wise 


AUC 

(%) 

HHR 

(%) 

AUC 

(%) 

RD 

(%) 

Social force (33 

63 

42 

- 

- 

MPPCA ED 

77 

30 

- 

- 

Social force + MPPCA 

71 

36 

- 

- 

Adam EQI 

63 

42 

- 

- 

MDT dl 

85 

25 

- 

- 

H-MDT CRF ED 

- 

18.5 

- 

70.1 

SVP ED 

92 

14 

- 

- 

STVP 

94.6 

10.6 

81.1 

78.8 


7A.2 Evaluation Protocol 

We use the standard protocol for evaluating abnormality 
detection results that was proposed by Mahadevan et al. m. 
The protocol consists of frame-wise and pixel-wise criteria. 
The frame-wise criterion labels a frame as abnormal if 
it contains at least one abnormal object detection. The 
localization accuracy of detected abnormalities is verified 
by the pixel-wise criterion that is more rigorous than the 
frame-wise criterion, since the detected abnormalities are 
compared to a pixel-level ground-truth mask. The pixel- 
wise criterion requires that at least 40% of all ground-truth 
abnormal pixels to be marked as abnormal in order to count 
a frame as true positive. By calculating the true positive rate 
(TPR) and false positive rate (FPR) at different detection 
thresholds we obtain the receiver operating characteristic 
(ROC). 

Frame-wise and pixel-wise criteria use the area under the 
curve (AUC) as a performance measure calculated directly 
from the corresponding ROC curve. For the frame-wise 
criterion we calculate also the equal error rate (EER) as 
a value obtained when the false positive and false negative 
rates are equal. For pixel-wise criterion we compute the 
rate of detection (RD), that is equal to 1—EER. The pixel- 
wise criterion is applied on the partially labeled UCSD pedl 
dataset originally provided with the pixel-wise ground-truth 
annotation. Moreover, we also provide complete pixel-wise 
ground-truth annotations for the full datasets and evaluate 
thereon. 

7.1.3 The Results of Evaluation 

Eig. compares the abnormality localization of our video 
parsing to the H-MDT CRE method ll^ on UCSD pedl 
test videos. The first row shows a person riding a bike in a 
group of walking persons. In the second row there are three 
abnormalities in the scene: a person riding a bike, and two 
persons running along the walkway. The third row shows a 
person skateboarding along the walkway, and the fourth row 
shows an unusual object (car) in the scene. The columns 
show: (i) initial hypotheses of video parsing, (ii) hypotheses 
selected by video parsing, (iii) abnormality localization 
results of video parsing, (iv) abnormality localization results 
of H-MDT CRE method 1^ . Due to our learned normal 



1 2 





225 


Fig. 11. Analysis of the false positive instances gener¬ 
ated by our video parsing on the UCSD pedl dataset. 
Instances are sorted in the decreasing order of their 
abnormality score. 


shape model used for explaining the foreground, we achieve 
better localization of the abnormalities in videos. 


In Eig. 12 we show more examples of the video parsing 
on UCSD pedl test videos. Row 1 shows two persons 
skateboarding and cycling on a very crowded walkway, 
row 2 a skateboarder in a group of pedestrians, and row 
3 two cyclists and a person walking across the walkway. 
By comparing the first two columns one can see that most 
hypotheses from the shortlist are discarded by video parsing 
because they get statistically explained away. 

We also compare quantitatively our video parsing ap¬ 
proach to the state-of-the-art methods on the challenging 
UCSD pedl and ped2 benchmarks m The methods used 
in our comparison are the mixture of dynamic textures 
(MDT) O, H-MDT CRE (261, social force model (SE) 
(431 . mixture of optical fiow (MPPCA) (24ll . optical fiow 
method (Adam et al.) (23, SE-fMPPCA (H, sparse recon¬ 
struction (Sparse), local statistical aggregates (ESA) (30|, 
and sparse combination learning (SCL) ED. Our previous 
approach (391 which parses video frames individually, one 
after another, is denoted as sequential video parsing (SVP). 
We denote by STVP the full spatio-temporal video parsing 
proposed in this paper. 

Our study shows that video parsing outperforms all other 
methods in experiments on both UCSD pedl and ped2 
datasets. Eig. shows ROC curves for the frame-wise la¬ 
beling of the UCSD pedl set. Tab. gives the performance 
measures for the pedl dataset. We see that the inclusion 
of the temporal component and the improved inference 
enables spatio-temporal video parsing to improve upon our 
previous sequential video parsing by 2.9% in AUC and 
5.1% in EER. Erom Tab. we also see that our approach 
improves upon recently proposed powerful methods such 
as ESA (30l (1.2% gain in AUC and 3.1% in EER) as well 
as SCL (3ll (2.1% gain in AUC and EER). All ROC plots 
for the pixel-wise labeling on pedl are shown in Eig. 
b) and c). Eor the partial pixel-wise labeling of pedl, the 
spatio-temporal video parsing achieves an improvement of 
4.7% AUC and 7.2% RD over the sequential video parsing. 


































TABLE 3 

Performance measures on 
the UMN dataset 
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False Positive Rate 



H-MDT CRF video parsing 


(a) (b) 

Fig. 10. Abnormality detection on the UMN dataset, (a) ROC curves for frame- 
wise labeling, (b) Detection results of the H-MDT CRF [26j (left column) and 
video parsing (right column). Our approach exhibits competitive performance 
as can also be seen from the corresponding AUC and EER statistics in Tab.|^ 


method 

AUC 

(%) 

F.F.R 

(%) 

chaotic invariants II46I 

99.4 

5.3 

social force II43I 

94.9 

12.6 

LSA OOl 

99.5 

3.4 

H-MDT CRF i26i 

99.5 

3.7 

Sparse II29I (scene 1) 

99.5 

- 

Sparse f2^ (scene2) 

97.5 

- 

Sparse f2^ (scene3) 

96.4 

- 

STVP (scene 1) 

99.5 

3.2 

STVP (scene2) 

97.5 

6.2 

STVP (scene3) 

99.9 

1.5 


We outperform the closest competitor (HDMT CRT ||26l) 
by 14.1% in AUC and 10.4% in RD. For the full pixel- 
wise labeling of pedl, we achieve an improvement of 2.5% 
in RD over the sequential video parsing. The competing 
HMDT CRF 1261 method we outperform in this case by 
1.5% in AUC and 5.0% in RD. 

The ROC curves for the frame-wise labeling of UCSD 
ped2 are given in Fig.j^a). The numerical results are given 
in Tab. We observe an improvement in performance of 
spatio-temporal parsing over sequential parsing by 2.6% in 
AUC and 3.4% in EER. The best method so far, MHDT 
CRE |l26l, we improve upon by 6.9% in EER. Eor the pixel- 
wise labeling of ped2 dataset, we outperform the compet¬ 
ing HMDT CRE method by 8.7% RD (AUC values for 
HMDT CRE are not provided in 1^ ). Overall we see that 
our spatio-temporal reasoning and the convex optimization 
based inference yield a significant improvement over the 
state-of-the-art. 

Due to temporal grouping of per-frame object can¬ 
didates (Sect. 1^, spatio-temporal video parsing requires 
significantly less hypotheses (only about a hundred for 
the whole spatio-temporal domain) than sequential video 
parsing ||39l, which needs the same number of hypotheses 
for representing single frames. Since there remain fewer 
hypotheses to process, spatio-temporal video parsing takes 
less time to execute than sequential video parsing. Our non- 
optimized Matlab implementation on a Dual-Core 2.7GHz 
CPU runs at about 1 fps, whereas our previous sequential 
video parsing took 5-10 secs per frame. This is on par with 
recent H-MDT CRE (SSI and Sparse [301 methods, with a 
notable exception of extremely fast SCL method ED. 

7.1.4 Analysis of False Detections 

To get a full understanding of the detection performance 
of proposed video parsing, we analyze the false detections 
on the UCSD pedl dataset. In Eig. im we see the first 
225 false detections sorted in the decreasing order of their 
probability of abnormality. We observe several reasons for 
false detections: i) In many cases, false detections appear 
as a result of artifacts in the foreground segmentation. In 


such cases, wrongly segmented pixels cannot be explained 
by the learned shape model and thus they are classified as 
abnormal, ii) Large variability of the normal human gait can 
sometimes be interpreted in video parsing as abnormal (e.g. 
running vs. fast walking), iii) Seldom errors in the provided 
video annotation cause that correctly detected abnormalities 
are sometimes considered as false (e.g. cars or running 
persons in Eig. [^. iv) When the true-positive hypothesis 
is missing from the shortlist due to a non-maximal recall, 
video parser can select an incorrect hypothesis as a next 
best fit. 


7.2 Evaluation on the UMN Anomaly dataset 


We additionally evaluate our video parsing on the UMN 
dataset that is widely used for benchmarking abnormality 
detection. The UMN dataset consists of three scenes in 
which periods of normal activity are followed by periods 
of emergency that are staged by people in the scene. In 
normal cases people are walking around alone or in groups. 
However, in emergency cases people start in panic to run 
away. Eor each scene several normal and abnormal events 
are happening one after another. In scene one, two and three 
there are two, six and three abnormal events, respectively. 
The dataset does not provide pixel-wise ground-truth ab¬ 
normality maps, so we follow the standard protocol for 
this dataset and evaluate the detection results only in a 
frame-wise manner. Eig. [T^ a) shows ROC curves for the 
frame-wise labeling. The performance measures AUC and 
EER are given in Tab. Eor scene one, our performance 
is on par with the best competing methods in terms of 
AUC (99.5%) and EER(3.2%). Eor scene two we achieve 
97.5% AUC that is equal to the best performing method 
(Sparse |[29|). Eor the scene three we achieve 99.9% AUC 
that improves upon the best competitor (Sparse (291) by 
3.5%. A qualitative comparison of our method to HMDT 


CRE (261 on two frames is shown in Eig. 10 b). We see 
that our method achieves best localization of abnormalities 
that is consistent with findings from earlier experiments on 
UCSD pedl and ped2. 
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normal abnormal 


Fig. 12. Additional results of video parsing on the UCSD ped1 dataset. Rows correspond to different examples. 
The first, third and fourth column correspond to the first three columns of Fig. The second column shows 
hypotheses that are selected from the shortlist by video parsing. Other hypotheses are discarded by explaining 
away using the selected hypotheses. Best viewed in color. 


8 Conclusion 

In this paper we have framed abnormality detection as 
spatio-temporal video parsing to circumvent the ill-posed 
problem of directly searching for individual abnormal local 
image regions. We detect abnormalities by searching for a 
set of spatio-temporal object hypotheses that jointly explain 
the video foreground and which are themselves explained 
by normal training samples. In video parsing we do not 
independently detect individual hypotheses, but their joint 
layout that collectively describes the objects in the scene. 
We use MAP inference in a graphical model to effectively 
localize abnormalities in video and solve it as a convex 
optimization problem. We have evaluated our approach on 
several challenging datasets, which show that video parsing 
advances the state-of-the-art both in terms of abnormality 
classification and localization. 


By replacing Oh with the sum from Eq. we see 
that the hypotheses explanation term Jh{{ohi'n%h}h) can 
be expressed as a linear function of the parsing indicator 
z, 

Jh{z) = b^z H- bo, 

where the parameter vector b = {bh,k}h,k and scalar bo 
are defined in the following way, 

bh,k = - log P{oh = l\dh) P log P{oh = 0\dh) 

+ logZidh) + l3A{dh,dk) -logPf^ih) 

-iogpr'(4-^r') 

bo = “E] log-P(o/, = 0|4), 

h 

and they do not depend on the parsing indicator z. □ 


Appendix A 
Proof of LemmaHH] 

Proof: The hypotheses explanation Jhiioh^fn-h^h) 
(Eq. [T0| i can be written as follows, 

Jh{{oh,mh}h) = “ Oh)\ogP{oh = 0\dh) 

h 

- Oh logP(o/i = l\dh) +Oh ■ log Z{dh) 

M 

+ 'pOh- l[mh = k]pA{dh,dk) 

k=i' '' ' 

= Zh,k 

- log pt’^iD- log pr^{ii-il^))}. 


Appendix B 
Proof of Lemma I4T^ 

Proof: The second derivative of the function 
X > 0 is given as follows. 






o—x\2 ‘ 


We see that the second derivative is positive, ^ 'ft{x) > 0, 
if the parameter /j is positive, /j > 0, so in this case 
the function is strictly convex. If the parameter 

/j equals zero, /j = 0, the function is linear, 

$ ft {x) = X, and therefore convex as well. □ 
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Appendix C 
Proof of Lemma [4T3] 

Proof: The foreground explanation Jj{{oh, 'mh\h) de- 
pends on all hypotheses that cover pixel j, 


Jj{{oh,mh}h) = 

“ /]) log-P(/| = Q\{oh,rnh,lh}h) 

3 

- fj ■ log(l - P{fj = 0\{oh,mh,lh}h))'^ 

= = 0\{oh,mh,lh}h))- 

3 

The argument of the function in the last equation 

is bilinear in the parsing indicator z (Eq. [TT]) and the joint 
shape prototype vector w (Eq. 

- log-P(/| = Q\{oh,mh,lh]h) 

= -log(l - Po) - y^log(l - P{fj = l\0h,mh,lh)) 

h 

= - log(l - Po) - ■ ^rrih = k]-l[j G 5^] 

h k " ^ 

= Zh,k 

■ X] ■ ij' + (4 4)"^] ■ iog-Pfe(4 = 0) 

r ' -^' 

= W~^CjZ + Co, 


where Gj is a sparse matrix with following elements, 

Cjikj'-,h,k) = i[i G Si] ■ 1[1] = si ■ 1], + (4 4)^], 

and the scalar cq has the value cq = — log(l — Po)- 
Thus, the foreground explanation term Jj{{oh^mh}h) 
can be written as 

Jj(z, w) = ^ {w^GjZ H- co^. 


Appendix D 
Proof of Lemma[4T4] 


Proof: The expression for the first derivative of the 
function ^ft{x) is 

^/] (4 = 1- fj ■ 

The absolute difference of the first derivative of function 
^ft {x) evaluated in points xi^X 2 > cq = — log(l — Po) is 
upper bounded in the following way. 


14 * 41 ) - 4 . 42)1 = /, 
= /* 


e-®i 


1 - e-*2 


e — e 


^ (1 - e-®i)(l - e“®2) = ■’1 p2 


< p_ 

— J 1 02 r 

0 

— - g-min{a;i,a:2} . — g-|^i-^2|^ 


O p2 


^ • 1^1 -^2| = P\X1 -X 2 I. 


f 3 p2 


In the last line of the proof we used the inequality 

1 — e~^ < X, \/x > 0. 


□ 
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